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Abstract 

We present a mathematically rigorous Quality-of-Service (QoS) metric which relates the achievable 
quality of service mefric (QoS) for a real-time analyfics service fo fhe server energy cosf of offering fhe 
service. Using a new iso-QoS evaluafion mefhodology, we scale server resources fo meef QoS fargefs 
and direcfly rank fhe servers in ferms of fheir energy-efficiency and by exfension cosf of ownership. 
Qur mefric and mefhod are plafform-independenf and enable fair comparison of dafacenfer compufe 
servers wifh significanf archifecfural diversify, including micro-servers. We deploy our mefric and 
mefhodology fo compare fhree servers rurming financial option pricing workloads on real-life markef 
dafa. We find fhaf server ranking is sensitive fo dafa inpufs and desired QoS level and fhaf alfhough 
scale-ouf micro-servers can be up fo two times more energy-efficienf fhan conventional heav 5 rweighf 
servers for fhe same fargef QoS, fhey are sfill six fimes less energy efficienf fhan high-performance 
compufafional accelerators. 


1 Introduction 

Susfaining a defined Qualify of Service (QoS) is an integral parf of any Service Level Agreemenf (SLA) perfaining 
fo fhe provision of enferprise level compufe services. These compufe services run on large dafa cenfers. The key 
business driver for fhe owners of fhese cenfers is fhe profif fo be made by charging end users for fhe services 
provided. QoS provision is an infegral parf of fhe owner profif and user cosf model of dafacenfer and dafacenfer 
services. 

Emerging services providing real-time dafa analyfics, such as frade and credif risk analyfics in fhe capifal 
markefs, incur a high usage and hosting premium. The reason is fhaf fhe compufafional workloads of fhese 
services are highly dynamic, evenf-driven, and demanding in ferms of fargef real-fime response lafency, which 
is often measured in microseconds. QoS provisioning for such services requires significanf invesfmenfs in server 
and nefworking infrasfrucfure, in addifion fo painsfaking optimization of fhe service software. 

A central question in provisioning hardware for real-fime dafa analyfics is fhe choice of compufe server ar- 
chifecfure fhaf will meef fhe lafency fargefs of fhe service, while reducing fhe operafional cosf of fhe dafacenfer 
and energy consumption in parficular. The choice is challenging because of vasf differences befween servers in 
archifecfure, price poinfs, operafional poinfs, and fargef markefs. As an example, fhe experimenfal campaign fhaf 
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we conducted for this paper suggests that a given QoS target for real-time option pricing workloads on actual 
market data feeds may be met by server hardware with power budgets ranging from 25W to over 200W and laten¬ 
cies ranging by a factor of five. How does the datacenter owner choose the best server for low-latency, real-time 
analytics workloads? Conversely, how does a user select the best equipped datacenter to run the same class of 
workloads? This paper sets to address these questions. 

In this paper we present a new QoS metric for the fair ranking of servers that support real-time analytics 
workloads with low latency requirements. The metric allows direct comparison between servers in terms of raw 
performance and energy-efficiency, while equating the QoS that they provide to users. This leads to an iso-QoS 
approach for ranking servers. We present a mathematically rigorous metric that accurately models dynamic work¬ 
loads with real-time event response deadlines and demonstrate that our metric fits well real-life financial option 
pricing workloads on actual market data. The metric and its derivation are platform-agnostic and can be used 
directly to optimize server provisioning for energy cost minimization under SLAs. 

We mine data presented in previous papers [1,2] to rank three servers in terms of iso-QoS under option pricing 
workloads: a scale-out microserver based on Calxeda SoCs; a dual-socket Intel Sandy Bridge server; and an Intel 
Xeon Phi server. Qur experimental campaign uses option pricing workloads for which we invested identical effort 
to optimize on each server. The campaign reveals new findings: The scale-out microserver can be up to two times 
more energy-efficient than heav 5 rweight servers under iso-QoS, but six times less energy-efficient than a high- 
performance co-processor. Importantly, the relative ranking of servers varies with the option pricing algorithm 
and input to the algorithm, while changing server provisioning produces also counter-intuitive rankings. 

The paper begins by briefly defining financial option contracts and their use in our real-time workloads in Sec¬ 
tion 2. We move on to details of the platforms used and a summary of our experimental methodology in Section 3. 
We present our mathematical model for QoS next, in Section 4 and apply an iso-QoS for two option pricing kernels 
to rank platforms in terms of energy efficiency. In Section 5 we discuss the results of our experimental campaign, 
while in Section 6 we present related work in the field. Section 7 describes the Nanostreams project within the 
context of which this work took place. The paper is concluded in Section 8. 


2 Computing Option Prices 

A financial Qption is a contract giving the owner the right to either sell (Put) or to buy (Call) a fixed number of 
assets, frequently company stock, for a defined price on, (European option) or before (American option) an end 
date. Methods from stochastic calculus produce equations to model option prices by simulating multiple paths of 
the underlying variables over a time window. Analytical solutions for these equations are not generally possible 
so a variety of computational numerical solution methods have been developed. We construct real-time analytics 
workloads that continuously execute Monte Carlo (MC) or Binomial Tree (BT) option pricing models. 

European vanilla options are a particular subset of option t 5 q)es. Black and Scholes [3, 4] proposed a second- 
order partial differential equation which models the variation of an option price with contractual strike price 
P, over time T years to contract expiry, assuming that the underlying asset spot price, S follows a log normal 
distribution and that the volatility u of S' the risk free rate of return, r, are constant. An analytic solution to this 
equation exists for European vanilla options but not generally for other t 5 q)es of options. Qur work focuses on 
European vanilla options because we can then use the Black-Scholes solution to provide a reference against which 
to compare our code base and its generated numerical results for accuracy. 

A rich literature already exists for both the MC and BT methods[5, 6]. therefore we present them only briefly 
here. An MC simulation computes the current price of a Put contract by 

— rT ^ 

Price = ^ max (^0, S - ( 1 ) 

i=l 

where Xi {i = 1... N) is a set of random numbers drawn from the standard normal distribution. We generate these 
using the 32-bit version of the Mersenne Twister algorithm [7] and the Box-Muller transformation. The BT pricing 
model discretises the time to expiry, T in years, into a lattice of N -f 1 levels with the root node as the current 
underlying asset price S. Starting at the root, an up and a down factor are applied to generate two prices at the 
next level. This continues, using the same constant factors, for all prices at all levels until the end level is reached. 
The final stage of the algorithm works backwards over the lattice computing an expectation value for each price 
at each level, finishing at the root node, which then contains the current option price. 

Both algorithms depend on a parameter N and both converge non-monotonically to an exact answer in the 
limit N —>■ oo. However they have different computational characteristics. Generic MC is a classic "for" loop 
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summation, requiring evaluation of transcendental functions, and its operation count scales as 0{N) while the BT 
is dominated by a nested for-loop of add-mulfiply operafions implemenfing fhe backward propagafion sfep and 
scaling as 


3 Experimental Setup and Measurement Methodology 

Our experimenfal sefup includes fhree plafforms on which we execufe our OpfionPricer program and collecf 
workload-specific performance and energy mefrics. This Secfion defines our mefrics, describes fhe plafforms 
used and presenfs salienf defails of our mefhodology used fo obfain fhe power readings and calculafe fhe energy 
consumpfion. A complefe descripfion of our mefhodology is available in [1]. 

3.1 Definition of Metrics 

Option pricing in finance takes place by consuming a live streaming data feed of stock market prices, often within 
the context of high frequency trading (HFT), and for pre-trade risk analytics. The execution time characteristics of 
option pricing are different from those of numerical simulation in computational science using HPC. By contrast to 
scientific codes which have measurable setup and post-processing phases, financial option pricing runs relatively 
small standalone kernels, such as MC and BT, at very high frequency with little set up and post processing work. 
Option pricing on live market data feeds is actually a form of event processing, where the event is the arrival of 
a price update on the underlying stock. Based on these distinctions we present and use three workload-specific 
metrics to compare servers under financial analytics workloads: 

QoS New prices may arrive at any time in a trading session. This means that any contracts not yet priced using 
the previous price update are abandoned and deemed unusable. Related to the Time/option metric below, but 
also dependent on market activity, we define the Quality of Service metric (QoS) as the ratio of successful to the 
total requested option price evaluations. The QoS metric is an application-specific measure on meeting option 
pricing performance requirements. It is useful for characterizing application-related performance and scalability 
offered by deploying multiple nodes. It is worth noting that QoS depends on the rate of stock price changes and 
other market activities at the time of its calculation, so it will be different each time it is calculated in a live market 
scenario. 

Joules/option (J/Qpt or Jopt) The energy consumed per execution of a pricing kernel is a fundamental metric. In 
the case of an actively traded stock, with a high number of defined option contracts, this building block is executed 
repeatedly throughout the trading day. Correspondingly, a reduction in this value can result in significant energy 
savings for providers offering option pricing services. 

Time/option (S/Qpt or Sopt) In contrast to providers, end users, particularly those engaged in HFT, are sensi¬ 
tive to end-to-end latency, thereby constraining the elapsed time per option metric. This metric in turn can be 
used to evaluate the total time to price all contracts for a given stock. Option pricing shares this time-to-solution 
performance metric in common with HPC applications. 

3.2 Hardware Platforms 

We used three platforms, one state-of-the-art server architecture with Intel Sandy Bridge processors (briefly re¬ 
ferred to as "Intel" in the rest of this paper), one state-of-the-art HPC architecture with Intel Xeon Phi Knights 
Comer coprocessor (referred to as "Xeon Phi") and a Calxeda ECX-1000 microserver with ARM Cortex A9 pro¬ 
cessors, packaged in a Boston Viridis rack-mounted unit (referred to as "Viridis"). We used the 4.7.3 version of 
the GCC compiler and the Intel Compiler ICC version 14.0.020130728 for code generation, the latter only on Intel 
platforms. The three platforms offer the possibility of scaling their frequency and voltage through a DVFS inter¬ 
face. We conducted experiments only with the highest voltage-frequency settings on each platform, to which we 
refer as performance mode. Previous work shows that performance mode is the most energy efficient too [1]. The 
details of the platforms are as follows: 

Intel is an x86-64 server with Sandy Bridge architecture, with 2 Intel Xeon CPU E5-2650 processors operating at 
a frequency of 2.00GHz and equipped with 8 cores each. The machine has 32GB of DRAM (4 x 8GB DDRS @ 
1600Mhz). The server runs on Linux GentQS 6.5 with kernel version 2.6.32 (2.6.32 — 431.17.1.el6.a;86_64). 
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Xeon Phi (Knights Comer) is a many core, x86-64 co-processor board (5110P model) over PCIe. It features the 
many integrated cores (MIC) architecture which offers sixfy, 4-way h 5 rperfhreaded cores, each equipped wifh a 
very wide (512-bif) vector unif. The board has more fhan 6 GB of GDDR5 DRAM, and fhe clock frequency is 1.053 
GHz. High performance and high energy efficiency are fhe resulf of feafuring a highly parallel many core design 
while rurming in low clock speeds. The sysfem runs on Linux kernel 2.6.38.8+mpss3.2.1. 

Viridis is a 2U rack mounted server confaining sixteen microserver nodes cormecfed infernally by a high-speed 
10 Gb Efhemef network. The plafform appears logically as sixteen servers wifhin one box. Each node is a Calxeda 
EnergyCore ECX-1000 comprising 4 ARM Corfex A9 cores and 4 GB of DRAM rurming Ubunfu 12.04 LTS. Viridis 
has a frequency of 1.4GHz. 

Note, when referring fo fhe differenf plafform seffings lafer we will use fhe following nofafion to represenf fhe 
plafform configuration [Nodes used x Cores Used x Threads per Core]. 

3.3 Software 

Sfarfing from a common C code base, we created versions which use fhe vecfor unifs on each plafform. We 
achieved fhis in fhree differenf ways 

• creafing assembler code implemenfafions of hofspof loops 

• using compile infrinsic C functions which map fo assembler insfrucfions 

• using fhe aufo vecforizafion funcfionalify of fhe kernel. 


Table 1: 

List of labels, VEC TYPE, defining the preparation of the executable binary 

VEC TYPE 

Description 

AVX256 

Assembler code using AVX 256-bit instructions on the Intel Sandyridge. 

INTRINSICS 

Compiler supplied C functions on any platform (ARM 128-bit, Intel 256-bit, Xeon Phi 512-bit) 

KNC512 

Assembler code for 512-bit vector instruction set on the Xeon Phi (Knights Corner). 

NEON128 

Assembler code for the ARM NEON 128-bit unit. 

AUTOVECT 

Compiler auto-vectorization on all platforms 


Table 1 defines fhe labels corresponding fo fhe t 5 rpe of binary. Each experimenf, reporfed lafer in fhis paper, is 
conducted by execufing one t 5 rpe of binary on one plafform and is labeled accordingly. 

3.4 Summary of Methodology 

Eor our experimenfs, we collected Eacebook stock price ticks during a full New York Stock Exchange session and 
replayed them using UDP multicast to all nodes in each of our platforms, as shown in Eigure 1. This is as close as 
an experiment needs to be to reality without any external glitches or factors affecting the setup or measurements. 
Detection of a change in the Eacebook stock price triggers computation of new prices for 617 Eacebook European 
options at the maximum speed feasible. 


Stored 

Trace 


Data 


-1 UDP Multicast 

tcpreplay | ^ 


Platform 
I Kernel | 

I libevenr~| 


I Measuring | 


Pricing 


Figure 1: Financial trace data measurement setup 

Next we discuss on the power measurement methodology. The exact form of the current supply path to the 
CPU differs from one platform to the next but to provide a fair basis for comparison we identified two distinct 
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points on the path, shown in Figure 2, which are measurable on all platforms. We continuously monitored power 
on each platform af fhese poinfs during our experimenfs. To isolafe fhe energy consumpfion of processor packages, 
we capfure power consumpfion af fhe point before fhe VRM, which we label PRE-VRM. For fhe Intel server, PRE- 
VRM measurement is facilitated by reading the Rurming Average Power Limit (RAPL) counters while the same 
functionality on Viridis is available through the Intelligent Platform Managemenf Inferface (IPMI) counfers, which 
is also available on fhe Xeon Phi plafform 



PRE-PSU PRE-VRM 


Figure 2: The path of the current supply to the CPU showing points at which we measured power. PSU 
is the power supply unit and VRM the voltage regulator module. 

Figure 3 shows the power versus time plot for a sfandalone execufion of fhe MC kernel. The BT execution plof 
is similar. The profile of insfanfaneous power versus fime follows a very sharp frapezoidal shape: fhe CPU is fully 
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Figure 3: CPU power vs. time for the MC kernel 

utilized during execution and there are no periods of inacfivify. This is a common feafure wifh ofher numerically 
infensive HPC applications. It means that the measured average power is a representative measure of energy 
consumpfion throughouf kernel execufion. 


4 The Mathematical basis of the QoS Metric 

Many of fhe worlds leading financial hading venues are order driven markefs, meaning fhaf invesfors, especially 
high frequency fraders, submif buy and sell orders independenfly fo mafching engine soffware operafing af high 
speed af fhe venue. These engines cross buy and sell orders fo creafe frades and are a key parf of fhe elecfronic 
hading plafforms which underpin high frequency frading. Sequential models, which are the basis to analyze 
trading patterns in high frequency frading, assume a Poisson disfribufion fo model fhe arrival of orders affecfing 
sfock price info fhe sysfem. 

4.1 The QoS as a cumulative frequency distribution 

In this section we explain how we create a QoS curve as a function of price gap frequencies. If is imporfanf fo nofe 
thaf fhis curve is dicfafed solely by fhe markef activity. In the next section we explain how we can determine using 
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the Sopt ad Jopt metrics for a given platform whefher we can meef a required QoS value or nof. 

From our dafa, we creafed a histogram of fhe disfribufion of fime gaps between price updafes for fhe Facebook 
sfock and from fhis compufed a cumulative frequency disfribufion (CFD) which we nofed exhibifs fhe characfer- 
isfics of a Poisson CFD. This reflecfs fhe assumptions of fhe sequential model of financial hading. 

Normally in a CFD fhe value assigned fo bin i is fhe sum of all values in bins 1,..., f. In our case fhese are 
fime bins so fhaf fhe frequency is fhe number of price updafes arriving af fime intervals up fo and including fhaf 
represenfed by bin i. There is a value of fhe fime gap, depending on fhe performance of fhe plafform, fhe number of 
opfions fo be priced and fhe kernel used, below which if is nof possible fo safisfy fhe hard consfrainf of compufing 
prices for all defined opfions. We denofed fhis by G. Our QoS mefric acfually corresponds fo fhe sum over all fime 
bins greater fhan fhis fhreshold. If follows fhaf our QoS funcfion is obfained by reflecfing fhe inifial CFD around 
ifs mid-poinf on fhe fime axis. This means fhaf we can fif our observed fime gap disfribufion fo fhe form 


* \t 

QoS(t) = 1 - e”-'XI uTT 


( 2 ) 


Furfhermore, we define fhe QoS, fhe y-axis, as a percenfage rafher an absolufe value. 

The dafa for our experimenfs are faken from a hading session of 6.5 hours where 10,156 price updafes occurred 
for fhe Facebook (FB) sfock, resulting in fhe cumulafive disfribufion funcfion representing fhe QoS shown in figure 
4. The solid line shows fhe measured values joined direcfly by sfraighf lines while fhe dashed curve shows fhe 
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Figure 4: Cumulative frequency distribution of Facebook and Google stock price updates for full trading 
sessions on July 7th and 15th 2014 


result of tiffing fhe measured dafa fo fhe analytic expression for fhe cumulafive Poisson disfribufion. Furfher 
confirmation of fhe Poisson-like behavior of fhe arrival of price updafes is seen in fhe profile for fhe Google sfock 
which is also presented in figure 4. Similar price update profiles occur in work [8] sfudying prices on fhe German 
DAX exchange. 


4.2 iso-QoS and total energy consumed 

Lef us sef a required QoS Y% for all our plafforms. From fhe QoS curve we can determine a minimum fime 
consfrainf, G, fhaf we musf safisfy. Wifhin G seconds we need fo compufe all Nopt opfions defined on fhe sfock. 
Firsf of all a plafform can only safisfy fhis consfrainf if 

G > Nopt X S-opt (3) 

Assuming fhis is mef, we know fhaf fhe energy consumed in each fime gap is fhen 

l^gap ~ h^opt X Jopt (4) 

where we ignore idle power. Nexf, we know from fhe definition of QoS fhaf fhe fofal number of fime gaps in 
which we will perform fhe compufafion is 

fVgaps = floor(F X Total number of updates for the session) (5) 
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so that the energy consumed doing option pricing while meeting QoS F% is 


^QoS=Y — h^gaps X h/gap (6) 

Platforms may then be ranked, for fhis QoS, in order of energy consumption. 

4.3 Application to platforms 

We have applied the equations defined above using the QoS curve in figure 4. Table 2 is the result of the analysis of 
delivering option pricing with a 10% QoS using the MC kernel operated with 0.5M iterations. Qnly the five cases 
(platform plus software) which can satisfy the constraint in equation (3) are reported. We noted that at 50% QoS 


Table 2: MC kernel (N=0.5M and QoS=10%) 


Platform 

VEC TYPE 

S/Opt 

J/Opt 

Energy(KJ) 

Viridis(16x4xl) 

INTRINSICS 

0.0038 

0.3830 

239.85 

Intel (2x8x1) 

AUTOVECT 

0.0044 

0.3794 

237.58 

Xeon Phi(lx60xl) 

KNC512 

0.0046 

0.2234 

139.92 

Xeon Phi(lx60x2) 

NOVECT 

0.0036 

0.1856 

116.26 

Xeon Phi(lx60x4) 

INTRINSICS 

0.0030 

0.1584 

99.19 


none of our platform/software combinations could satisfy the constraint in equation (3). We have commented on 
this characteristic previously [2] explaining that it means only that a subset of all available options can be priced, 
but not the full set. The MC kernel involves relatively expensive evaluation of the natural logarithm in the Box 
Muller transform and the exponential function to compute the option price. 

We repeated the analysis with the BT kernel, which is dominated by multiply add operations, and report 
results for QoS values of 80% and 40% in tables 3-8. 


Table 3: BT kernel (N=4000 and QoS=80%) 


Platform 

VEC TYPE 

S/Opt 

J/Opt 

Energy(KJ) 

Intel (2x8x1) 

AVX256 

0.0007 

0.0611 

306.49 

Viridis(16x4xl) 

NEON128 

0.0006 

0.0603 

302.41 

Intel (lx 8x1) 

INTRINSICS 

0.0013 

0.0527 

264.32 

Xeon Phi(lx60x4) 

INTRINSICS 

0.0005 

0.0131 

65.88 

Xeon Phi(lx60x2) 

INTRINSICS 

0.0004 

0.0107 

53.50 

Xeon Phi(lx60xl) 

INTRINSICS 

0.0004 

0.0092 

46.27 


Table 4: BT kernel (N= 

5000 and QoS= 

80%) 

Platform 

VEC TYPE 

S/Opt 

J/Opt 

Energy(KJ) 

Intel (2x8x1) 

INTRINSICS 

0.0015 

0.1180 

591.65 

Intel (lx 8x1) 

INTRINSICS 

0.0022 

0.1017 

509.69 

Viridis(16x4xl) 

INTRINSICS 

0.0010 

0.0912 

457.05 

Xeon Phi(lx60xl) 

INTRINSICS 

0.0006 

0.0157 

78.58 

Xeon Phi(lx60x4) 

INTRINSICS 

0.0006 

0.0152 

76.23 

Xeon Phi(lx60x2) 

KNC512 

0.0005 

0.0139 

69.76 
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Table 5: BT kernel (N=7000 and QoS=80%) 


Platform 

VEC TYPE 

S/Opt 

J/Opt 

Energy(KJ) 

Intel (2x8x1) 

INTRINSICS 

0.0032 

0.3038 

1522.85 

Viridis(16x4xl) 

INTRINSICS 

0.0017 

0.1679 

841.83 

Xeon Phi(lx60x2) 

AUTOVECT 

0.0007 

0.0281 

140.84 

Xeon Phi(lx60x4) 

INTRINSICS 

0.0009 

0.0275 

138.02 

Xeon Phi(lx60xl) 

KNC512 

0.0007 

0.0216 

108.28 


Table 6: BT kernel (N= 

4000 and QoS= 

40%) 

Platform 

VEC TYPE 

S/Opt 

J/Opt 

Energy(KJ) 

Intel (2x8x1) 

AVX256 

0.0007 

0.0611 

153.24 

Viridis(16x4xl) 

NEON128 

0.0006 

0.0603 

151.21 

Intel (lx 8x1) 

INTRINSICS 

0.0013 

0.0527 

132.16 

Xeon Phi(lx60x4) 

INTRINSICS 

0.0005 

0.0131 

32.94 

Xeon Phi(lx60x2) 

INTRINSICS 

0.0004 

0.0107 

26.75 

Xeon Phi(lx60xl) 

INTRINSICS 

0.0004 

0.0092 

23.13 


Table 7: BT kernel (N= 

5000 and QoS= 

40%) 

Platform 

VEC TYPE 

S/Opt 

J/Opt 

Energy(KJ) 

Intel (2x8x1) 

INTRINSICS 

0.0015 

0.1180 

295.82 

Intel (lx 8x1) 

INTRINSICS 

0.0022 

0.1017 

254.85 

Viridis(16x4xl) 

INTRINSICS 

0.0010 

0.0912 

228.52 

Xeon Phi(lx60xl) 

INTRINSICS 

0.0006 

0.0157 

39.29 

Xeon Phi(lx60x4) 

INTRINSICS 

0.0006 

0.0152 

38.11 

Xeon Phi(lx60x2) 

KNC512 

0.0005 

0.0139 

34.88 


Table 8: BT kernel (N= 

7000 and QoS= 

40%) 

Platform 

VEC TYPE 

S/Opt 

J/Opt 

Energy(KJ) 

Intel (2x8x1) 

INTRINSICS 

0.0032 

0.3038 

761.42 

Intel (lx 8x1) 

AVX256 

0.0052 

0.2526 

632.95 

Viridis(16x4xl) 

INTRINSICS 

0.0017 

0.1679 

420.92 

Xeon Phi(lx60x2) 

AUTOVECT 

0.0007 

0.0281 

70.42 

Xeon Phi(lx60x4) 

INTRINSICS 

0.0009 

0.0275 

69.01 

Xeon Phi(lx60xl) 

KNC512 

0.0007 

0.0216 

54.14 
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In figure 5 we show how energy of fhe scaled ouf configurafions varies wifh fhe number of poinfs used. We 
are comparing Viridis(16x4xl) fo Infel(2x8xl) and show how fhe Viridis can acfually outperform Intel's Sandy 
Bridge while provisioning for an 80% QoS. 



Figure 5: BT kernel energy consumption scaling (at QoS=80%) of Viridis(16x4xl) and Intel(2x8xl) 


5 Discussion 

With a fixed number of options, achieving the G constraint for a given QoS is inversely proportional to the Sopt 
metric for the platform and software combination. The energy consumption, therefore the ranking, is not only 
proportional to the Jopt metric but also depends on Sopt, which determines the time needed to price a set of options. 
In our work, the top of ranking means the least energy consumption. 

Across all the experiments, Xeon Phi is an excellent proposition for energy efficiency, ranking at the top. It 
consumes 2 x up to an order of magnitude less energy than any other platform in any iso-QoS comparison. This 
is because Xeon Phi features a highly parallel and highly energy efficient manycore architecture which matches 
the parallelization and vectorization opportunities of the pricing kernels, especially BT. Interestingly, Xeon Phi 
has increasingly better energy efficiency compared to other platforms the higher the QoS target is and the more 
iterations the kernels performs. This means Xeon Phi energy efficiency scales better than in any other platform. 

Viridis, scaled out to 16 nodes, ranks equivalently and up to 2 x better than Intel across all experiments. A trend 
is visible in the BT kernel results, as the problem size increases. Specifically, the energy used by Viridis(16x4xl) 
rises more slowly than Intel the bigger the problem size. Indicatively, when N = 4000, regardless the QoS target, 
Viridis consumes almost the same energy as Intel. However, when N = 7000, Viridis uses approximately half the 
energy of the Intel configurations. 

Focusing on BT kernel experiment, it is interesting to note that details of the Xeon Phi configurations which 
rank at the top are different. Assuming a target QoS of 80%, when N = 4000, the BT kernel can be served most 
efficiently by the Xeon Phi(lx60xl) INTRINSICS configuration. When N is increased to 5000 this configuration 
is no longer the most energy efficient being replaced by the Xeon Phi(lx60x2) KNC512. Most interesting when 
moving to A = 7000, the Xeon Phi(l x60xl) KNC512 becomes again the most energy efficient. Although, a higher 
N indicates a heavier computational load, the single-thread per core Xeon Phi configuration has better energy 
efficiency. This indicates that algorithmic input affects energy consumption in ways that are hard to predict and 
provision and we leave this investigation as future work. 

It is worth noting that none of the top performers involved compiler auto-vectorization. AUTQVECT binaries 
are absent from most of the tables because these configuration were frequently unable to satisfy the G constraint. 
Qbserving the tables, the AUTQVECT compiler approach may generate the lowest Sopt metrics but this does not 
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correspond necessarily to a low Jopt metric. This is because compiler optimizations target reducing execution time 
but not energy consumption. 

In addition to the use of the QoS metric to rank platforms fairly, fhe graph and ifs analyfic fif when combined 
with values for fhe Sopt and Jopt mefrics allow d 5 mamic predictions and modeling which is of use fo dafa cenfer 
managers for capacify planning exercises. There are a variefy of cosfs involved in running a dafa cenfer buf 
simulations of energy consumption and for a fhree fier (3T) configuration reporf [9] 70% of fhe energy being 
consumed by fhe servers of which 43%, fhe largesf single componenf, is from CPUs (modeled as rurming 130W). 
Economic cosf models distinguish variable cosf from fixed cosf. For example, fhe purchase and insfallafion of 
fhe plafform represenfs fhe fixed cosf. Our QoS mefric addresses parf of fhe so-called variable cosfs by fargefing 
fhe cosf of fhe fundamenfal building block of fhe service provision, namely fhe fimely compufafion of option 
kernels. This allows predictive modeling of fhe economic opfion cosf, which is associafed wifh choosing fo fargef 
fhe requiremenfs of one sef of end user customers rafher fhan anofher. 


6 Related Work 

Recent related work explores the performance and power consumption of servers based on low-power ARM pro¬ 
cessors [10,11] suggesfs fhaf nof all server workloads benefif from maximizing core counfs and core frequencies, 
fhus pinpoinfing opporfunifies for energy-efficiency optimization. Qur work supporfs fhese findings buf esfab- 
lishes a new mefric and mefhod for comparing servers fairly, whereby we equate fhe objective QoS and allow 
server resource scaling in our comparisons, as opposed fo equafing hardware parameters such as hardware fea- 
fure sizes or core counfs. The work of Blem ef al [12] sfudies fhe performance and power consumpfion of several 
ARM and Intel processors buf performs head-fo-head comparisons of numerous performance and energy mefrics, 
instead of normalizing againsf one key mefric, which is our approach. 

Iso-mefrics are common fools parallel and disfribufed computing. Iso-efficiency [13] in terms of susfained 
fo fheorefical maximum speedup has routinely been used fo compare combinations of parallel algorifhms and 
archifecfures. Iso-energy-efficiency [14, 15] explores fhe influence of core scaling and frequency scaling on fhe 
energy-efficiency of algorifhms and archifecfures. We esfablish a new mefric fhaf cafers fo fhe needs of real-fime 
analytical workloads and emerging archifecfures fhaf differ vasfly in power budgefs and form factors, and furfher 
esfablish fhaf fhe new mefric is more appropriate fo compare server value propositions given modem hardware 
diversify. 

Related fo our work is also prior research on improving fhe energy-efficiency of real-fime financial workloads. 
Schryver ef al [16] presenf a mefhodology for effidenf design of hardware accelerators for opfion pricing, whereby 
fhey cap fhe power consumpfion of fhe accelerator and fhe system as a whole. Morales ef al [17] propose an FPGA 
design, programmable using QpenCF to build energy-efficient versions of binomial opfion pricing algorifhms. 
They reporf a performance of 2,000 Qpfions/second which is consisfenf or lower fhan the performance affained 
by our Xeon Phi and scaled-ouf Viridis implemenfafions, buf wifh a power budgef of 20W, which is lower fhan 
that of any of our plafforms. Hardware opfimizafion of our workloads is beyond fhe scope of fhis paper buf 
within the scope of our ongoing work in fhe NanoSfreams projecf The mefhod presented in this paper fixes a 
workload-cenfric QoS mefric instead of a sysfem-cenfric mefric, while allowing flexibilify in funing bofh sysfem 
and workload parameters fo meef fhe objecfive mefric. 


7 The NanoStreams Project 

The work reporfed in fhis paper has been carried ouf wifhin fhe wider confexf of our Nanosfreams projecf 
The projecf bridges fhe performance gap between microservers and large servers by enhancing microservers wifh 
application-specific, energy-efficient and programmable accelerators. The project is building a heterogeneous mi¬ 
croserver with a host SoC and an analytics accelerator SoC, with a total power budget under 10 Watts, where a 
performance-equivalent system with state of fhe arf server-class processors would consume abouf 170 Waffs. 

NanoSfreams achieves ifs goals by adopting a scale-ouf approach where multiple microservers and sharable 
accelerafors are densely replicafed and packaged fo build systems wifh equivalenf performance of large-scale 
servers buf a dramafically smaller form facfor. A cenfral feafure of fhis is a co-designed soffware sfack provid¬ 
ing elasfic and scale-free co-execufion of parallel workloads. NanoSfreams uses processor-based FPGAs using 

^(http: / / www.nanostreams.eu) 

^(http:/ / www.nanostreams.eu) 


10 



dataflow processing engines (nano-cores) and automatic C compiler generation technology to ease programming 
of the heterogeneous micro-server. In this paper we have demonstrated that microservers are viable alternatives 
for low-latency, real-time financial analytics, even if based on the now outphased Calxeda ECX-1000 SoC and the 
dated Cortex A9 core. We will be evaluating more recent ARM-based SoCs based on 64-bit cores with GPU and 
FPGA accelerators in future work. 


8 Conclusions 

In this paper we have presented a mathematical formulation of an application-driven QoS metric for the provision 
of financial option pricing services. This metric is a function of two workload-specific but architecture-agnostic 
metrics, seconds per option and Joules per option, plus several application parameters which define the numerical 
approximation computed. Notably, our study used real stock market streaming data and captured the d 5 mamic, 
event-driven nature of real-time financial analytics workloads. 

Our metric facilitated direct performance comparisons between server platforms with radically different ar¬ 
chitectural operating points and price points. By defining a fixed QoS, a typical requirement for a service level 
agreement between a datacenter provider and the end user, we have applied iso-QoS to rank different platforms 
fairly, with a repeatable workload using real-life and real-time data. Our results show reveal several interesting 
findings: For example, a microserver with scaled out nodes (Viridis 16x4x1) consumes significantly less energy 
than a heavy-duty Intel Sandy Bridge server (2x8x1) for multiple QoS targets. When scaling out the number of 
points for computations the microserver consumes about half of the Intel server's energy. 

Our model benefits directly datacenter operators during hardware procurement and capacity plarming exer¬ 
cises as it provides values which contribute to the economic option cost of providing service to one or other group 
of end-users. 

Our approach creates many avenues for future research. At its most fundamental our method allows eval¬ 
uation of a QoS metric for any problem domain in which events, in the present case price updates, have to be 
processed by intense compute kernels, before the next event arrives. Thus the seconds per option metric would be 
replaced more generally by a seconds per kernel metric, similarly for the Joules per option metric. An alternative 
direction of research is to incorporate the number of processors as a variable in the methodology and thus d 5 mam- 
ically provision the platforms to accommodate varying demand and a target QoS, while attempting to minimize 
energy consumption. The metric can also be extended to cater for the provisioning of heterogeneous platforms. 
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